2024-10-08
Contingency Tables (Sections 13-14 in Course Book)
Suppose now that we are investigating factors affecting tuberculosis prevalence among intravenous drug users.
What does the 2x2 table look like?
The 2x2 Table is…
TB+ TB-
share 24 73
don't 28 133
twobytwo (with Bayesian Augmentation)To start, we’ll test the null hypothesis that the population proportions of intravenous drug users who have a positive tuberculin skin test result are identical for those who share needles and those who do not.
\[ H_0: \pi_{share} = \pi_{donotshare} \\ H_A: \pi_{share} \neq \pi_{donotshare} \]
We’ll use the Bayesian augmentation.
twobytwo results (95% CI)twobytwo(24+2, 73+2, 28+2, 133+2,
"Sharing", "Not Sharing",
"TB test+", "TB test-", conf.level = 0.95)2 by 2 table analysis:
------------------------------------------------------
Outcome : TB test+
Comparing : Sharing vs. Not Sharing
TB test+ TB test- P(TB test+) 95% conf. interval
Sharing 26 75 0.2574 0.1816 0.3513
Not Sharing 30 135 0.1818 0.1301 0.2482
95% conf. interval
Relative Risk: 1.4158 0.8910 2.2498
Sample Odds Ratio: 1.5600 0.8594 2.8318
Conditional MLE Odds Ratio: 1.5572 0.8189 2.9511
Probability difference: 0.0756 -0.0244 0.1819
Exact P-value: 0.1638
Asymptotic P-value: 0.1438
------------------------------------------------------
2 by 2 table analysis:
------------------------------------------------------
Outcome : TB test+
Comparing : Sharing vs. Not Sharing
TB test+ TB test- P(TB test+) 90% conf. interval
Sharing 26 75 0.2574 0.1925 0.3351
Not Sharing 30 135 0.1818 0.1375 0.2365
90% conf. interval
Relative Risk: 1.4158 0.9599 2.0884
Sample Odds Ratio: 1.5600 0.9458 2.5729
Conditional MLE Odds Ratio: 1.5572 0.9023 2.6818
Probability difference: 0.0756 -0.0088 0.1646
Exact P-value: 0.1638
Asymptotic P-value: 0.1438
------------------------------------------------------
This table displays the count of patients who show complete, partial, or no response after treatment with either active medication or a placebo in a study of 100 patients…
| Group | None | Partial | Complete |
|---|---|---|---|
| Active | 8 | 24 | 20 |
| Placebo | 12 | 26 | 10 |
Is there a meaningful association here?
The Pearson \(\chi^2\) test assumes the null hypothesis is true (rows and columns are independent.) That is a model for our data. How does it work?
Here’s the table, with marginal totals added.
| – | None | Partial | Complete | TOTAL |
|---|---|---|---|---|
| Active | 8 | 24 | 20 | 52 |
| Placebo | 12 | 26 | 10 | 48 |
| TOTAL | 20 | 50 | 30 | 100 |
The test needs to estimate the expected frequency in each of the six cells under the assumption of independence. If the rows and columns were independent, what is the expected count in the Active/None cell?
| – | None | Partial | Complete | TOTAL |
|---|---|---|---|---|
| Active | – | – | – | 52 |
| Placebo | – | – | – | 48 |
| TOTAL | 20 | 50 | 30 | 100 |
If the rows and columns were independent, then:
So, can we fill in the expected frequencies under our independence model?
| – | None | Partial | Complete | TOTAL |
|---|---|---|---|---|
| Active | 8 (10.4) | 24 (26.0) | 20 (15.6) | 52 |
| Placebo | 12 (9.6) | 26 (24.0) | 10 (14.4) | 48 |
| TOTAL | 20 | 50 | 30 | 100 |
\[ \mbox{Expected Frequency} = \frac{\mbox{Row total} \times \mbox{Column total}}{\mbox{Grand Total}} \]
This assumes that the independence model holds: the probability of being in a particular column is exactly the same in each row, and vice versa.
| – | None | Partial | Complete | TOTAL |
|---|---|---|---|---|
| Active | 8 (10.4) | 24 (26.0) | 20 (15.6) | 52 |
| Placebo | 12 (9.6) | 26 (24.0) | 10 (14.4) | 48 |
| TOTAL | 20 | 50 | 30 | 100 |
We’ll put the table into a matrix in R. Here’s one approach…
What is the conclusion?
Yes, but … if the Pearson assumptions don’t hold, then the Fisher’s test is not generally an improvement.
Fisher's Exact Test for Count Data
data: T1
p-value = 0.1358
alternative hypothesis: two.sided
dm1 <- read_csv("c13/data/dm464_class12.csv", show_col_types = FALSE) |>
janitor::clean_names() |>
mutate(tobacco = fct_relevel(tobacco, "Current", "Former"),
insurance = fct_relevel(insurance, "Medicare",
"Commercial", "Medicaid"))
dm1 |> tabyl(tobacco, insurance) |>
adorn_totals(where = c("row", "col")) tobacco Medicare Commercial Medicaid Total
Current 33 8 76 117
Former 92 23 71 186
Never 68 30 63 161
Total 193 61 210 464
dm1 datadm1Pearson \(\chi^2\) results?
Pearson's Chi-squared test
data: tabyl(dm1, insurance, tobacco)
X-squared = 28.574, df = 4, p-value = 9.542e-06
Can we check our expected frequencies?
insurance Current Former Never
Medicare 33 92 68
Commercial 8 23 30
Medicaid 76 71 63
insurance Current Former Never
Medicare 48.66595 77.36638 66.96767
Commercial 15.38147 24.45259 21.16595
Medicaid 52.95259 84.18103 72.86638
Any problems with Cochran conditions?
Each rectangle’s area is proportional to the number of cases in that cell.
In the old days, I used to teach…
\[ H_0: \mu_1 = \mu_2 \mbox{ vs. } H_A: \mu_1 \neq \mu_2 \]
In the old days, let \(\delta\) be the (true) mean of the paired differences.
\[ H_0: \delta = 0 \mbox{ vs. } H_A: \delta \neq 0 \]
In the old days, I used to teach…
If the 95% confidence interval doesn’t include the null hypothesized value, then the \(p\) value is below 0.05, and again…
I also spent meaningful time on making your work look like everyone else…
The idea of a p-value as one possible summary of evidence
morphed into a
The idea of a p-value as one possible summary of evidence
morphed into a
which morphed into a
The idea of a p-value as one possible summary of evidence
morphed into a
which morphed into a
which morphed into a
The idea of a p-value as one possible summary of evidence
morphed into a
rule for authors: reject the null hypothesis if p < .05, which morphed into a
rule for editors: reject the submitted article if p > .05, which morphed into a
rule for journals: reject all articles that report p-values.
Bottom line: Reject rules. Ideas matter.
2016
2019
… a world learning to venture beyond “p < 0.05”
This is a world where researchers are free to treat “p = 0.051” and “p = 0.049” as not being categorically different, where authors no longer find themselves constrained to selectively publish their results based on a single magic number.
In this world, where studies with “p < 0.05” and studies with “p > 0.05” are not automatically in conflict, researchers will see their results more easily replicated – and, even when not, they will better understand why.
The 2016 ASA Statement on P-Values and Statistical Significance started moving us toward this world. As of the date of publication of this special issue, the statement has been viewed over 294,000 times and cited over 1700 times-an average of about 11 citations per week since its release. Now we must go further.
The ASA Statement (2016) was mostly about what not to do.
The 2019 effort represents an attempt to explain what to do.
Some of you exploring this special issue of The American Statistician might be wondering if it’s a scolding from pedantic statisticians lecturing you about what not to dowith p-values, without offering any real ideas of what to do about the very hard problem of separating signal from noise in data and making decisions under uncertainty. Fear not. In this issue, thanks to 43 innovative and thought-provoking papers from forward-looking statisticians, help is on the way.
If you’re just arriving to the debate, here’s a sampling of what not to do.
http://jamanetwork.com/journals/jamaotolaryngology/fullarticle/2546529
A label of statistical significance adds nothing to what is already conveyed by the value of p; in fact, this dichotomization of p-values makes matters worse.
The common practice of dividing data comparisons into categories based on significance levels is terrible, but it happens all the time…. so it’s worth examining the prevalence of this error. Consider, for example, this division:
Now consider some typical p-values in these ranges: say, p = .005, p = .03, p = .08, and p = .2.
Translate these two-sided p-values back into z-scores…
| Description | really sig. | sig. | marginally sig. | not at all sig. |
|---|---|---|---|---|
| p value | 0.005 | 0.03 | 0.08 | 0.20 |
| Z score | 2.8 | 2.2 | 1.8 | 1.3 |
The seemingly yawning gap in p-values comparing the not at all significant p-value of .2 to the really significant p-value of .005, is only a z score of 1.5.
If you had two independent experiments with z-scores of 2.8 and 1.3 and with equal standard errors and you wanted to compare them, you’d get a difference of 1.5 with a standard error of 1.4, which is completely consistent with noise.
From a statistical point of view, the trouble with using the p-value as a data summary is that the p-value can only be interpreted in the context of the null hypothesis of zero effect, and (much of the time), nobody’s interested in the null hypothesis.
Indeed, once you see comparisons between large, marginal, and small effects, the null hypothesis is irrelevant, as you want to be comparing effect sizes.
From a psychological point of view, the trouble with using the p-value as a data summary is that this is a kind of deterministic thinking, an attempt to convert real uncertainty into firm statements that are just not possible (or, as we would say now, just not replicable).
The key point: The difference between statistically significant and NOT statistically significant is not, generally, statistically significant.
If I have to boil it down to one thing, it’s not that p values or confidence intervals are inherently bad data summaries, although their interpretation is by no means straightforward.
It’s that the whole notion of statistical significance is the problem.
next time…
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)
Locale:
LC_COLLATE=English_United States.utf8
LC_CTYPE=English_United States.utf8
LC_MONETARY=English_United States.utf8
LC_NUMERIC=C
LC_TIME=English_United States.utf8
Package version:
askpass_1.2.0 backports_1.5.0 base64enc_0.1.3
bayestestR_0.14.0 bit_4.0.5 bit64_4.0.5
blob_1.2.4 broom_1.0.6 bslib_0.8.0
cachem_1.1.0 callr_3.7.6 cellranger_1.1.0
cli_3.6.3 clipr_0.8.0 cmprsk_2.2-12
coda_0.19-4.1 codetools_0.2-20 colorspace_2.1-1
compiler_4.4.1 conflicted_1.2.0 correlation_0.8.5
cpp11_0.5.0 crayon_1.5.3 curl_5.2.2
data.table_1.16.0 datasets_4.4.1 datawizard_0.12.3
DBI_1.2.3 dbplyr_2.5.0 digest_0.6.37
dplyr_1.1.4 dtplyr_1.3.1 easystats_0.7.3
effectsize_0.8.9 emmeans_1.10.4 Epi_2.55
estimability_1.5.1 etm_1.1.1 evaluate_1.0.0
fansi_1.0.6 farver_2.1.2 fastmap_1.2.0
fontawesome_0.5.2 forcats_1.0.0 fs_1.6.4
gargle_1.5.2 generics_0.1.3 ggplot2_3.5.1
glue_1.7.0 googledrive_2.1.1 googlesheets4_1.1.1
graphics_4.4.1 grDevices_4.4.1 grid_4.4.1
gtable_0.3.5 haven_2.5.4 highr_0.11
hms_1.1.3 htmltools_0.5.8.1 httr_1.4.7
ids_1.0.1 insight_0.20.4 isoband_0.2.7
janitor_2.2.0 jquerylib_0.1.4 jsonlite_1.8.9
knitr_1.48 labeling_0.4.3 lattice_0.22-6
lifecycle_1.0.4 lubridate_1.9.3 magrittr_2.0.3
MASS_7.3-61 Matrix_1.7-0 memoise_2.0.1
methods_4.4.1 mgcv_1.9-1 mime_0.12
modelbased_0.8.8 modelr_0.1.11 multcomp_1.4-26
munsell_0.5.1 mvtnorm_1.3-1 nlme_3.1-164
numDeriv_2016.8-1.1 openssl_2.2.1 parallel_4.4.1
parameters_0.22.2 patchwork_1.3.0 performance_0.12.3
pillar_1.9.0 pkgconfig_2.0.3 plyr_1.8.9
prettyunits_1.2.0 processx_3.8.4 progress_1.2.3
ps_1.8.0 purrr_1.0.2 R6_2.5.1
ragg_1.3.2 rappdirs_0.3.3 RColorBrewer_1.1.3
Rcpp_1.0.13 RcppArmadillo_14.0.2.1 readr_2.1.5
readxl_1.4.3 rematch_2.0.0 rematch2_2.1.2
report_0.5.9 reprex_2.1.1 rlang_1.1.4
rmarkdown_2.28 rstudioapi_0.16.0 rvest_1.0.4
sandwich_3.1-1 sass_0.4.9 scales_1.3.0
see_0.9.0 selectr_0.4.2 snakecase_0.11.1
splines_4.4.1 stats_4.4.1 stringi_1.8.4
stringr_1.5.1 survival_3.7-0 sys_3.4.2
systemfonts_1.1.0 textshaping_0.4.0 TH.data_1.1-2
tibble_3.2.1 tidyr_1.3.1 tidyselect_1.2.1
tidyverse_2.0.0 timechange_0.3.0 tinytex_0.53
tools_4.4.1 tzdb_0.4.0 utf8_1.2.4
utils_4.4.1 uuid_1.2.1 vctrs_0.6.5
viridisLite_0.4.2 vroom_1.6.5 withr_3.0.1
xfun_0.47 xml2_1.3.6 xtable_1.8-4
yaml_2.3.10 zoo_1.8-12
431 Class 13 | 2024-10-08 | https://thomaselove.github.io/431-2024/